The ExpressionSet class (ex_sc) is a convenient data structure that contains 3 dataframes. These dataframes contain expression data, cell information, and gene information respectively.
exprs(ex_sc) is the expression data, where rows are genes and columns are cells
pData(ex_sc) is cell information, where rows are cells and columns are metadata
fData(ex_sc) is gene information, where rows are genes and columns are metadata
ex_sc <- construct_ex_sc(mDC_UMI_Table) # sc_dat == Input expression matrix
ex_sc # Note that phenoData and featureData are empty right now!
ExpressionSet (storageMode: lockedEnvironment)
assayData: 11584 features, 3814 samples
element names: exprs
protocolData: none
phenoData: none
featureData: none
experimentData: use 'experimentData(object)'
Annotation:
ex_sc_skin
ExpressionSet (storageMode: lockedEnvironment)
assayData: 14757 features, 3698 samples
element names: exprs
protocolData: none
phenoData
sampleNames: CB22_H1_ATAACAGGATCAGGGA CB22_H2_AAGCTCCTCTCACATC ...
VB65_NL_CTGGGTATAAACACGG (3698 total)
varLabels: Patient Skin
varMetadata: labelDescription
featureData
featureNames: A1BG A1CF ... ZZZ3 (14757 total)
fvarLabels: Gene Coding
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:
exprs(ex_sc)[1:5,1:5]
0hrA_TGACGGACAAGTAATC 0hrA_CACAACAGTAGCCTCG 0hrA_GTTTGTTTGCACCTCT
0610007P14Rik 0 0 0
0610009B22Rik 0 0 0
0610009O20Rik 0 0 0
0610010B08Rik 0 0 0
0610010F05Rik 0 0 0
0hrA_GCTTACCTTGACCCTC 0hrA_GGAGAAGCGCTTTGGC
0610007P14Rik 0 0
0610009B22Rik 0 0
0610009O20Rik 0 0
0610010B08Rik 0 0
0610010F05Rik 0 0
pData(ex_sc)[1:5,]
data frame with 0 columns and 5 rows
fData(ex_sc)[1:5,]
data frame with 0 columns and 5 rows
rm(mDC_UMI_Table)
Often we have metadata about the experiment that can be valuable in the analysis! Writing that information now may be appropriate. Our experiment consists of a time course with LPS stimulation.
pData(ex_sc)$Timepoint <- NA # initialize a new pData column
pData(ex_sc)[grep("0hr", rownames(pData(ex_sc))),"Timepoint"] <- "0hr"
pData(ex_sc)[grep("1hr", rownames(pData(ex_sc))),"Timepoint"] <- "1hr"
pData(ex_sc)[grep("4hr", rownames(pData(ex_sc))),"Timepoint"] <- "4hr"
head(pData(ex_sc))
Timepoint
0hrA_TGACGGACAAGTAATC 0hr
0hrA_CACAACAGTAGCCTCG 0hr
0hrA_GTTTGTTTGCACCTCT 0hr
0hrA_GCTTACCTTGACCCTC 0hr
0hrA_GGAGAAGCGCTTTGGC 0hr
0hrA_AAATCAGAGATCTCGG 0hr
The first step is to filter your data to remove low quality cells. Often creating a histogram of the values and assigning cutoffs is simple and effective. Typically we remove all cells lower than 500-1000 UMIs / cell.
ex_sc <- calc_libsize(ex_sc, suffix = "raw") # sums counts for each cell
ex_sc <- pre_filter(ex_sc, minUMI = 1000, maxUMI = 10000, threshold = 1, minCells = 10, print_progress = TRUE) # filters cells and genes
[1] "Filtering Genes"
[1] "Filtering Low Cells"
ex_sc <- calc_libsize(ex_sc, suffix = "raw_filtered")
plot_density(ex_sc, title = "UMI Density", val = "UMI_sum_raw_filtered", statistic = "mean")
Dimensionality reduction is necessary in order to bring the cells from a high dimensional gene expression space (~10k dimensions) down to a more reasonable number. Typically this is done first with PCA ~5-15 dimensions, before a final embedding is done using tSNE or UMAP to bring it down to 2 dimensions.
gene_subset <- subset_genes(ex_sc, method = "PCA", threshold = 1, minCells = 20, nComp = 10, cutoff = 0.85)
ex_sc <- dim_reduce(ex_sc, genelist = gene_subset, pre_reduce = "iPCA", nComp = 10, tSNE_perp = 30, iterations = 500, print_progress=TRUE)
[1] "Starting iPCA"
[1] "Starting tSNE"
Read the 3812 x 10 data matrix successfully!
Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
Computing input similarities...
Building tree...
Done in 0.58 seconds (sparsity = 0.033243)!
Learning embedding...
Iteration 50: error is 85.934926 (50 iterations in 0.86 seconds)
Iteration 100: error is 80.296177 (50 iterations in 0.81 seconds)
Iteration 150: error is 80.013512 (50 iterations in 0.77 seconds)
Iteration 200: error is 79.974289 (50 iterations in 0.79 seconds)
Iteration 250: error is 79.965020 (50 iterations in 0.77 seconds)
Iteration 300: error is 2.469057 (50 iterations in 0.60 seconds)
Iteration 350: error is 2.128633 (50 iterations in 0.57 seconds)
Iteration 400: error is 1.964351 (50 iterations in 0.56 seconds)
Iteration 450: error is 1.867714 (50 iterations in 0.57 seconds)
Iteration 500: error is 1.805818 (50 iterations in 0.57 seconds)
Fitting performed in 6.87 seconds.
colnames(pData(ex_sc))
[1] "Timepoint" "UMI_sum_raw" "UMI_sum_raw_filtered"
[4] "x" "y" "iPC_Comp1"
[7] "iPC_Comp2" "iPC_Comp3" "iPC_Comp4"
[10] "iPC_Comp5" "iPC_Comp6" "iPC_Comp7"
[13] "iPC_Comp8" "iPC_Comp9" "iPC_Comp10"
plot_tsne_metadata(ex_sc, color_by = "Timepoint")
plot_tsne_metadata(ex_sc, color_by = "iPC_Comp1", title = "PC1 cell loadings")
plot_tsne_metadata(ex_sc, color_by = "iPC_Comp2", title = "PC2 cell loadings")
plot_tsne_metadata(ex_sc, color_by = "iPC_Comp3", title = "PC3 cell loadings")
Now let us try dimension for the skin data!! First we can inspect the skin data to get a sense of what we are working with. Once we get a sense for the data we can then use the same functions as above to perform gene selection and dimension reduction.
dim(ex_sc_skin)
Features Samples
14757 3698
colnames(pData(ex_sc_skin))
[1] "Patient" "Skin"
colnames(fData(ex_sc_skin))
[1] "Gene" "Coding"
plyr::count(pData(ex_sc_skin))
Patient Skin freq
1 CB17 1-Healthy 53
2 CB19 1-Healthy 115
3 CB20 1-Healthy 153
4 CB21 1-Healthy 107
5 CB22 1-Healthy 602
6 CB23 1-Healthy 13
7 CB27 1-Healthy 235
8 VB114 2-NonLesional 27
9 VB114 3-Disease 14
10 VB126 2-NonLesional 25
11 VB126 3-Disease 452
12 VB130 2-NonLesional 48
13 VB130 3-Disease 203
14 VB131 2-NonLesional 35
15 VB131 3-Disease 7
16 VB132 2-NonLesional 24
17 VB132 3-Disease 20
18 VB65 2-NonLesional 134
19 VB65 3-Disease 234
20 VB66 2-NonLesional 46
21 VB66 3-Disease 92
22 VB75 2-NonLesional 17
23 VB75 3-Disease 13
24 VB76 2-NonLesional 153
25 VB76 3-Disease 80
26 VB77 2-NonLesional 180
27 VB77 3-Disease 449
28 VB85 2-NonLesional 19
29 VB85 3-Disease 148
# Use the subset_genes function to find variable genes in the skin data. Be sure to provide the input argument and method argument. Use ?subset_genes() to view help pages for functions. Try different methods and compare the number of genes you get out for each method.
gene_subset <- subset_genes(ex_sc_skin, method = "PCA", threshold = 1, minCells = 30)
# Use the dim_reduce function to create a 2D representation of the skin data. Be sure to provide the input argument and a gene list.
ex_sc_skin <- dim_reduce(ex_sc_skin, genelist = gene_subset, pre_reduce = "iPCA", nComp = 10, iterations = 500)
[1] "Starting iPCA"
[1] "Starting tSNE"
Read the 3698 x 10 data matrix successfully!
Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
Computing input similarities...
Building tree...
Done in 0.56 seconds (sparsity = 0.033051)!
Learning embedding...
Iteration 50: error is 85.439689 (50 iterations in 1.33 seconds)
Iteration 100: error is 76.362799 (50 iterations in 1.10 seconds)
Iteration 150: error is 75.479012 (50 iterations in 0.67 seconds)
Iteration 200: error is 75.264499 (50 iterations in 0.65 seconds)
Iteration 250: error is 75.183824 (50 iterations in 0.62 seconds)
Iteration 300: error is 2.357471 (50 iterations in 0.50 seconds)
Iteration 350: error is 2.036250 (50 iterations in 0.49 seconds)
Iteration 400: error is 1.883499 (50 iterations in 0.50 seconds)
Iteration 450: error is 1.798063 (50 iterations in 0.50 seconds)
Iteration 500: error is 1.743551 (50 iterations in 0.52 seconds)
Fitting performed in 6.87 seconds.
# Now you can plot metadata from pData() or genes of interest, onto the tSNE mapping.
plot_tsne_metadata(ex_sc_skin, color_by = "Patient", facet_by = "Skin")
Now that we have dimension reduced data we can try clustering it! For dimensions, both Comp and 2d are supported. There will determine if the clustering is done on principal components, or on the 2D representation. There are also 2 clustering algorithms available, density and spectral. Typically we recommend spectral clustering on PCA components, or density clustering on the 2d representation. Try both!
ex_sc <- cluster_sc(ex_sc, dimension = "Comp", method = "spectral", num_clust = 4)
ex_sc$cluster_spectral <- ex_sc$Cluster
ex_sc <- cluster_sc(ex_sc, dimension = "2d", method = "density", num_clust = 4)
Distance cutoff calculated to 0.05219639
ex_sc$cluster_density <- ex_sc$Cluster
plot_tsne_metadata(ex_sc, color_by = "cluster_spectral")
plot_tsne_metadata(ex_sc, color_by = "cluster_density")
Now let us try clustering for the skin data!! Try both density based and spectral clustering!
ex_sc_skin <- cluster_sc(ex_sc_skin, dimension = "Comp", method = "spectral", num_clust = 4)
plot_tsne_metadata(ex_sc_skin, color_by = "Cluster")
There are many possible ways to identify cell types based on their gene expression. The id_markers function will identify genes that are highly expressed in a high proportion of a given cluster, relative to the other clusters.
ex_sc <- id_markers(ex_sc, print_progress = TRUE) # This is a quick method to find good markers genes for cell identification. These gene scores get written to fData()
[1] "Finding markers based on fraction expressing"
[1] "Finding markers based on mean expressing"
[1] "Merging Lists"
ex_sc <- calc_agg_bulk(ex_sc, aggregate_by = "Cluster")
markers <- return_markers(ex_sc, num_markers = 15)
plot_heatmap(ex_sc, genes = unique(unlist(markers)), type = "bulk")
[[1]]
[[2]]
Call:
hclust(d = d, method = "complete")
Cluster method : complete
Distance : euclidean
Number of objects: 60
Now try to identify the cell types in the skin data!
# ex_sc_skin <- id_markers()
# markers <- return_markers()
# ex_sc_skin <- calc_agg_bulk()
# plot_heatmap()
Now that the data has preliminary clusters, we can normalize. SCRAN normalization will first normalize internally in clusters, before normalizing across clusters. Once the data is normalized we can run the same steps as above before visualization. The first step is to select the genes to be used for normalization. One method would be to first only use genes expressed in more than n cells, and then remove the most variable genes. This method can be computationally expensive, and is currently commented out. A simpler approach, counts per million, is also provided below.
table(pData(ex_sc)$Cluster)
Cluster1 Cluster2 Cluster3 Cluster4
942 909 1624 337
# ex_sc_norm <- norm_sc(ex_sc, pool_sizes = c(20,25,30,35,40))
x <- exprs(ex_sc)
cSum <- apply(x,2,sum) # recompute for remaining cells
x <- as.matrix(sweep(x,2,cSum,FUN='/'))*1e6 # normalize to UMIs per million
ex_sc_norm <- construct_ex_sc(x)
pData(ex_sc_norm) <- pData(ex_sc)
Now that we have normalized, it is time to reprocess the data as before, this time on the normalized counts! Use the ex_sc_norm normalized counts from above to run through the same processing steps as before.
# gene_subset <- subset_genes()
# ex_sc_norm <- dim_reduce()
# ex_sc_norm <- cluster_sc()
# plot_tsne_metadata()
# plot_tsne_gene()
From the above analysis, it is clear that some clusters are formed based on their cell type, while others are based on their experimental condition. In these cases it can be useful to incorporate prior information in order to obtain clusters and annotations that are grounded in biological significance. Below, we can assign “panels” similar to flow cytometry, that will enable cell groupings based on the expression of genes that you believe to signify biologically relevenant cell types.
panel1 <- c("S100a9", "Lcn2") # Neutrophil Markers
panel2 <- c("Ccr7", "Fscn1") # DC
panel3 <- c("Csf1r", "Mertk") # Mac
panels <- list(panel1, panel2, panel3)
plot_tsne_gene(ex_sc_norm, gene = unlist(panels), title = "", log_scale = T)
names(panels) <- c("Neutrophil", "Dendritic", "Macrophage")
ex_sc_norm <- flow_filter(ex_sc_norm, panels = panels, title = "Flow Pass Cells")
ex_sc_norm <- flow_svm(ex_sc_norm, pcnames = "Comp")
Loading required package: lattice
plot_tsne_metadata(ex_sc_norm, color_by = "cluster_spectral", title = "Spectral Cluster on PCA components")
plot_tsne_metadata(ex_sc_norm, color_by = "SVM_Classify", title = "Spectral Cluster on PCA components")
Now that cells are grouped by their cell type, we can run DE in order to determine which genes are change in association with our experimental conditions.
For simplicity we can subset to 0hr and 4hr, to find the genes that change between these conditions.
It should be noted that DE should always be run on raw counts, not on the normalized counts!
ex_sc_norm_0_4 <- subset_ex_sc(ex_sc_norm, variable = "Timepoint", select = c("0hr", "4hr"))
[1] "Subsetted Data is 2808 cells"
0hr 4hr
1729 1079
findDEgenes(input = ex_sc,
pd = pData(ex_sc_norm_0_4),
DEgroup = "Timepoint",
contrastID = "0hr",
facet_by = "SVM_Classify",
outdir = "~/Downloads/")
[1] "Performing DE for Dendritic"
Calculating norm factors...
Estimating dispersion...
Doing likelihood ratio fit...
[1] "Performing DE for Macrophage"
Calculating norm factors...
Estimating dispersion...
Doing likelihood ratio fit...
[1] "Performing DE for Neutrophil"
Calculating norm factors...
Estimating dispersion...
Doing likelihood ratio fit...
plot_volcano(de_path = "~/Downloads/", de_file = "Macrophage_0hr_DEresults.tsv", fdr_cut = 0.000001, logfc_cut = 2)
plot_violin(ex_sc_norm_0_4, color_by = "Timepoint", facet_by = "SVM_Classify", gene = "Cxcl9")
plot_violin(ex_sc_norm_0_4, color_by = "Timepoint", facet_by = "SVM_Classify", gene = "Rsad2")
Now try to run DE between the 0hr and 1hr timepoints on the mouse data. Then make a volcano plot of the genes that are significantly changed (FDR < 0.001, logfc_cut >= 2) within Dendritic cells.
# ex_sc_norm_0_1 <- subset_ex_sc()
# findDEgenes()
# plot_volcano()